-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: allow 1D numpy array #1185
base: main
Are you sure you want to change the base?
Conversation
Hey, thank you for your contribution! Would you be able to add a test here: integration/test_batch_v4.py? Numpy is available as a test dependency |
To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge. |
agree with the CLA |
b196d53
to
02abfbd
Compare
Hi @dirkkul, I think should be fine now. The two previous tests which failed (DB issue and timeout on grpc) were not connected to this PR. |
802a74d
to
a2f116b
Compare
@dirkkul friendly reminder: can this branch be merged? |
a2f116b
to
5120b1a
Compare
integration/test_collection.py
Outdated
@@ -60,6 +61,15 @@ | |||
DATE3 = datetime.datetime.strptime("2019-06-10", "%Y-%m-%d").replace(tzinfo=datetime.timezone.utc) | |||
|
|||
|
|||
def get_numpy_vector(input_list: list) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have numpy in our dev requirements, so you can assume it is present for tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, simplified
(and rebased - hopefully that fixes the failing test)
@@ -57,7 +57,7 @@ def pack_vector(vector: Any) -> bytes: | |||
collection=obj.collection, | |||
vector_bytes=( | |||
pack_vector(obj.vector) | |||
if obj.vector is not None and isinstance(obj.vector, list) | |||
if obj.vector is not None and not isinstance(obj.vector, dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think handling of vector input should be at the input level, eg in insert_many and batch.add_objects and not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what you mean by this and some hints how to accomplish?
I was staring at the code and don't see how I could push the validation upstream, since obj.vector
can be both None
and a dict
. Couple of lines below where vectors
is constructed: if obj.vector is not None and isinstance(obj.vector, dict)
, which means either vector_bytes
or vectors
will be not-None, but not both at the same time.
For readability, we could move these two next to each other - just cosmetics.
Note: the only thing I would do is to move pack_vector
outside as is _pack_named_vectors
, and rename the latter to _pack_named_vector
. But that is again just cosmetics :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct place would be 'weaviate/collections/data/data.py::insert_manyand then the cleanup should be here
vector=obj.vector`. But keep in mind that vectors can be lists and dictionaries (for named vectors) and both should be supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dirkkul, thanks for coming back. However, I still don't see how this could be achieved without a major refactoring.
There are 3 possibilities in the current program flow, starting from data/_DataCollectionAsync.insert_many()
:
type(obj) | type(obj.vector) | _BatchObject.vector | vector_bytes | vectors |
---|---|---|---|---|
!DataObject | n/a | None | None | None |
DataObject | !dict | obj.vector | pack_vector(...) | None |
DataObject | dict | obj.vector | None | _pack_named_vectors(...) |
So move one needs to move some of the logic from __grpc_objects()
into insert_many()
, e.g. creating a list of tuples containing (_BatchObject, vector_bytes, vectors). Is this what you are suggesting or am I on the wrong path please?
Update: the last commit contains a first try at refactoring. Let me know your thoughts please.
e36d66c
to
3649fc5
Compare
3649fc5
to
9f21009
Compare
c564504
to
e98b0aa
Compare
Hey, I think you are mixing two things up:
We would want that user supplied vectors that are numpy arrays are converted to python lists. So if you do
it should take care of things. |
addbb98
to
00a60a8
Compare
00a60a8
to
27422ac
Compare
Hi @dirkkul, thanks for your patience, fingers crossed I got it now... |
Friendly ping @dirkkul |
Fixes #1002
This is the simplest fix which allows a 1d numpy array to be passed in (additionally, anything which can be converted in the function
util.get_vector
). This will raise aTypeError
if the input is incorrect, which is converted later toWeaviateInvalidInputError
.Note that passing in something else then
types.VECTOR
, e.g. a numpy array, will raise a mypy error.